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ABSTRACT 



Highlighting similcirities and differences between networks is an informative 
task in investigating many biological processes. Typical examples are de- 
tecting differences between an inferred network and the corresponding gold 
standeird, or evaluating changes in a dynamic network along time. Although 
fruitful insights can be drawn by qualitative or feature-based methods, a 
distance must be used whenever a quantitative assessment is required. Here 
we introduce the Ipsen-Mikhailov metric for biological network compcirison, 
based on the difference of the distributions of the Laplacian eigenvalues of 
the compared graphs. Being a spectral measure, its focus is on the gen- 
eral structure of the net so it can overcome the issues Eiffecting local metrics 
such as the edit distances. Relation with the classical Matthews Correlation 
Coefficient (MCC) is discussed, showing the finer discriminant resolution 
achieved by the Ipsen-Mikhailov metric. We conclude with three examples 
of application in functional genomic tasks, including stability of network 
reconstruction as robustness to data subsampling, variability in dynamical 
networks and differences in networks associated to a classification task. 

Key words: Network comparison, Network distance, Graph spectrum, Laplacian 
matrix. 



1 INTRODUCTION 



Networks methods in biology have recently gained popularity among researchers world- 
wide and they are nowad ays pervading a relevant portion of scient i fic lit erature: see 



(jPavlopoulos et al 



201 ll ) for a recent review and (IBuchanan et al 



2OIOI ) for a com- 



prehensive reference. Their role is believed to hav e an even higher imp a ct in future : 



a goo d example is the case of network medicine ( iBarabasi et al. 



2011 



Vidal et al. 



20111 ). A central problem is the comparison of biological networks, a task occur- 
ring in many areas of biology. Examples include detecting similarities in gene net- 
works related to the same pathway across different species, or tracking the evolution 
of the network wiring during a biological process, or highlighting variations between 
networks associated to different pathophysiological conditions. Classical comparison 
measures are the pairs Precision/Recall or Sensitivity/Specificity, or the F-score (for 
instance in network reconstruction), or the Maximal Common Subgraph distance (in 
network alignment). More recently, the use of the Matthews Correla tion Coefficient 



(MCC) f lThe MicroArrav Quahtv Control fMAQC) Consortium 



20101) has been bor- 



rowed from the machine learning community as a more reliable indicator for summa- 
rizing the confusion matrix into a single figure. Other cost-based functions stem from 
the theory of graph matching: the edit distance and its variants use the minimum 
cost of transformation of one graph into another by means of the usual edit operations 
- insertion and deletion of 



literature: see (IBunke 



inks. These similarity measures are widely considered in 
2OOOI ) for an introductory review. In alternative, the theory of 
network measurements relies on the quantitative description of main properties such as 
degree distribution and correlation, path lenghts, diameter, clustering, presence of mo- 
tives. However, all these measures are local, because, for each link, only the structure 
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of its neighbourhood gives a contribution to the distance value, while the structure of 
the whole topology is not considered. To overcome the locality issue in network com- 
parison, a few global distances have been proposed: among them, the family of spectral 
measures is particularly relevant. As the name suggests, their definition is based on 
(functions of) the spectrum of one of the possible connectivity matrices of the network. 



i. e. its set of eigenvalues. T he spectral theory has been applied to biological networks 

lc|] 



in ( iBanerjee and Jost 



20091 ). where the properties of being scale-freqj and small- work 
are particularly evident. Isospectral networks cannot be distinguished by this class of 
measures, so all these measures are indeed distances between classes of isospectral 
graphs: however, the numbe r of iso spectral networks is negligible for l a rge nu mber of 



nodes (IHaemers and Spence 



20041 ). In a recent paper (jjurman et al. 



20111), we de- 



scribed six spectral distances, showing their behaviour on synthetic benchmarks and on 



the transcriptional network of E. coli from the RegulonDE 



database. On the ground 



of such experiments, we choose Ipsen-Mikhailov e distance fllpsen and Mikhai 



ovL 



2OO2I) 



201lh we 



out of the six original metrics for stability and robustness. In f lBarla et al. . 
show a complete functional genomic pipeline employing Ipsen-Mikhailov metric for the 
detection of the discriminant pathways after a machine learning preprocessing. The 
e metric evaluates the difference of the distribution of Laplacian eigenvalues between 
two networks: as s uch, it can also be interpreted as a m easure of the different network 



synchronizability (iBelvkh et al.l . 120051 : 



Wu et al 



20081 ) . Here we show the relation of 



e distance with MCC, and we present examples of application for network comparison 
in situations of biological interest such evolving dynamical network and comparison of 



^ Scale- freeness: the degree distribution follows a power law. 

^Small-world nets: most nodes are not neighbors of one another, but most nodes can be reached 
from every other by a small number of hops or steps. 



http : / / regulondb . ccg . unam . mx/ 
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miRNA networks associated to predictive discrimination in hepatocellular carcinoma. 
Finally, we also show the use of Ipsen-Mikhailov distance in evaluating the stability 
of the reconstruction of a network from microarray data in terms of robustness to 
data subsampling, in order to quantitatively express the level of rehabihty of a given 
inference. 



2 IPSEN-MIKHAILOV e DISTANCE 



Originally introduced in ( llpsen and Mikhailovl . 120021 ) as a tool for network reconstruc- 
tion from its Laplacian spectrum, the definition of the Ipsen-Mikhailov e metric follows 
the dynamical interpretation of a A^-nodes network as a A^-atoms molecules connected 
by identical elastic strings, where the pattern of connections is defined by the adjacency 
matrix of the corresponding network. The dynamical system is described by the set of 
differential equations 



N 



for 2 = 0, ■ ■ ■ , A^ - 1 . 



(1) 



We recall that the Laplacian matrix L of an undirected network is defined as the differ- 
ence between the degree D and the adjacency A matrices L = D — A, where D is the 
diagona l matrix with vertex degrees as eiitries. L is positive semide f jnite and singular 



Chung 



1997 



Atav et al. 



2006 



Spielman 



2009 



Tonjes and Blasius 



2009 



Atav et al. 



20061 ). so its eigenvalues are = Aq < Ai < ■ ■ ■ < A„_i. The vibrational frequencies Ut 



for the network model in Eq. [T] are given by the eigenva. 
of the network: Aj = uf, with Xq = uq = 0. In fIChung 



ues of the Laplacian matrix 



19971 ). the Laplacian spec- 



trum is called the vibrational spectrum. Estimat es (also asymptotic) o f the eigenvalues 



distribution are available for complex networks f lRodgers et al 



2005f ). Moreover, the 



relation between the spectral properties and the structure and the dynamics of a net- 



work are discussed in (iJost and Joyl . 120021 : iJostl . 120071 : lAlmendral and Diaz-Guileral . 



20071 ). 

The spectral density for a graph as the sum of Lorentz distributions is defined as 

N-l 



7 



00 



where 7 is the common width and K is the normahzation constant solution of / p{uj)du) 

Jo 

The scale parameter 7 specifies the half-width at half-maximum, which is equal to half 
the interquartile range. It works as a multiplicative factor for the distance and in all 
experiments hereafter, 7 is set to 0.08 as in the original reference. 

Then the spectral distance e between two graphs G and H with densities Pg{^) and 
Puiy) can then be defined as 



Because of the definition of Ipsen-Mikhailov distance, a comparison can be computed 
only between networks with the same (number of) nodes. In order to get rid of the 
intrinsic dependence of the distance of the number of nodes of the compared networks, 
a normalization factor can be introduced, defined as the distance between En and F^-, 
respectively the totally disconnected and the fully connected graph on n nodes: 

for n the number of nodes of G and H. 



3 RELATION WITH THE MATTHEWS CORRE- 
LATION COEFFICIENT 



We first compare e with Matthews Correlation Coefficien t (MCC for short) , a mea- 



sure of common use in the machine learning community (iBaldi et al 



20001) and re 



cently accepted as an effective metric also for network comparison (iSupper et al. 



Stokic et al. 



20071 : 



20091 ). The MCC allows summarizing into a single value the confusion 
matrix of a binary classification task, thus working as a reliable alternative to mea- 
sures obtained as functions of Sensitivity/Specificity and Precision/Recall. Originally 



introduced in f Matthews 



19751 ). it is also known as the ^-coefficient, corresponding for 



a 2 X 2 contingency table to the square root of the average statistic 

MCC = v/?7iV , 

where N is the total number of observations. As an example of use in bioinformatics, 
MCC has been chosen as the reference metric in the US FDA-led initiative MAQC- 
II aimed at reaching consensus on the best practices for development and validation 
of predictive models b ased on microarray gene expression and genotyping data fo r 



personalized medicine ( iThe MicroArray Quality Control (MAQC) Consortium 



20101 ). 



In the binary case of two classes positive (P) and negative (A^), for the confusion 
matrix (fptn)' where T and F stand for true and false respectively, the Matthews 
Correlation Coefficient has the following shape: 

^^CC TP ■ TN - FP ■ FN 

v/(TP + FP) (TP + FN) (TN + FP) (TN + FN) ' 

MCC lives in the range [—1, 1], where 1 is perfect classification, —1 is reached in the 
complete misclassification case while corresponds to coin tossing classification. Note 
that MCC is invariant for scalar multiplication of the whole confusion matrix. 



We compare e and MCC in two synthetic network experiments. 

First we generate 1000 pairs of network topologies {Ni,N2) on n = 1000 nodes as 
follows. The adjacency matrix for A^^i is randomly generated by associating to each of 
the (l) = = 4950 possible links a weight w sampled by a uniform distribution in 

the unit interval: a link is then declared existing whenever w > 0.75. The network N2 
is generated by rewiring A^^i through deletion of pi% of the existing links and insertion 
of p2% novel links, for pi and p2 uniformly sampled in [0,90]. Then, for each pair 
{Ni, N2), we compute the MCC and the e metrics: the results are displayed in Fig. [H 
The plot suggests that, although there is a coherent trend between the two measures, 
the variability is quite high: (anti) correlation value for the two measures is 0.901. Figure 1 

The second experiment is aimed at quantifying the detected variability. A simple 
network N is created on 10 nodes with 20 links (of 45 potential) to be used as the 
ground truth: its topology is displayed in Fig. El^a). Then a set of 1000 networks 
S = {Ni}l^^ is created from the topology of by randomly deleting 5 links (the 
total number of all such networks is (^5°) = 9302400). All elements of S have confusion 
matrix ( tp ™ ) _ ^ 15 0. ) ^nd thus for each A^, e S, MCC = ^^^^j^g^^^ = vjo ^ o.79. 

For each A^j, the corresponding distance to the ground truth e(Aj, A^) is computed: 
the corresponding histogram of the 1000 values of the Ipsen-Mikhailov distance is shown 
in Fig. m As expected, the variability in the obtained values for e is very high: the range 
is [0.2670, 0.6438], with mean 0.4010 and median 0.3977. This result shows that the Figure 2 
topologies of networks with a given confusion matrix can be structurally very different. 
For instance, in Fig. I3^b,c) we show the two networks A'min, A'max associated to extremal 



values of e. Both these experiments support the claim that Ipsen-Mikhailov metric Figure 3 
has an higher resolution in discriminating between network structures. 
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4 APPLICATIONS 

Evolution of dynamic networks 



In ( iKolar et al. 



20101 ) ■ the authors used the Keller algorithm to infer the gene regula- 
tory networks of Drosophila melanogaster from a time series of gene expression data 
measured during its full life cycle. They selected 66 time points during the developmen- 
tal cycle, spanning across four different stages (Embryonic - time points 1-30, Larval 
- t.p. 31-40, Pupal - t.p. 41-58, Adult - t.p. 59-66), following the dynamics of 588 
gene ontological groups and then constructing a time series of inferred networks N£. 
Hereafter we evaluate the structural differences between Ni and A^j+i and the distance 
between and the initial network A^^i, measured either by the Ipsen-Mikhailov dis- 



tance or by MCC: the resulting plots are displayed in Fig. |H The largest variations. Figure 4 

both between consecutive terms and with respect to the initial network A^^i, occur in 

the embrional stage (E). As expected, the variations between consecutive terms (panels 

(a) and (c)) are smaller, while more relevant changes occur comparing a term with A^^i. 

In particular, it is interesting to note that the dynamics of the networks move Ni away 

from A^^i until time points 20, then the following terms start getting closer again to A^i 

in terms of Ipsen-Mikhailov distance. The same trend is captured by MCC, but with 

lower resolution: the fact that MCC curve is ascending from its minimum in the last 

15 time points can be appreciated only by zooming in from panel (d) to panel (e). This 

means that, after the embrional stage, the network is getting structurally more and 

more similar again to A^^i, but with a limited number of links matching those of A'^i. 



Adjacency matrices are available at http://cogito-b.inl.cmu.edu/keller/downloads.htinl 
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Networks in profiling tasks 



In the papers (iBudhu et al. 



2008 



Ji et al. 



20091 ) ■ the authors introduced and analyzed 



a dataseto collecting 482 tissue samples from 241 patients affected by hepatocellular 
carcinoma (HCC). For each patients, a sample from cancerous hepatic tissue and a 
sample from surrounding non-cancerous hepatic tissue have been hybridized on the 
Ohio State University CCC MicroRNA Microarray Version 2.0 platform consisting 
of 11520 probes collecting expressions of 250 non-redundant human and 200 mouse 
microRNA (miRNA). 

By the Machine Learning pipeline detailed in Tab. [T]we extract the top-20 optimal 
set of features discriminating cancer samples from controls. Most of them are already 
known in literature as associated with hepatocellular carcinoma. 

The following phase consists in the construction of the six weighted miRNA net- 

( M+F ) nT by using 
I ' ■ 

three different infere nce algor ithm: WGCNA 



2010 



Horvath 



201ll ). Aracne ( iMargolin et al. 



1 

(iZhane and Horvath, 


2005; 


Zhao et al.. 


2006 


) and CLR ( 


Faith et al. 


, 2007 


), also 



Table 1 



Table 2 



considering their binarized versions after thresholding. As an example, in Fig. [5] we 

show the correlation networks at threshold 0.85 in all the six considered cases: the 

number of links in the healthy tissue case is always larger than in the cancerous tissue 

case. Using Ipsen-Mikhailov distance, it is now possibile to quantitatively explore Figure 5 

the similarity among the miRNA profile networks: in what follows, we show some 

examples. 

For instance, in Fig. |6] we show how distances between four couples of correlation 
networks evolve with the correlation threshold travelling between 0.1 and 0.9. The 



^Available at Gene Expression Omnibus (GEO) http : //www . ncbi . nlm . nih . gov/geo/| at the ac- 
cession number GSE6857 
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two closest networks are those corresponding to the Control tissues, with a classwise 
related trend independent from gender. In fact, the two curves expressing respectively 
the distance in the Tumorous tissue case between Male and Female patients and the 
corresponding curve for the Control tissue have a similar shape up to correlation thresh- 
old 0.8. Finally, for Female patients, the Tumoral network is quite distant from the 
Control one, hilighting a wider biological transformation caused by the disease than in 
Male patients. 

In Tab. [3] the Ipsen-Mikhailov distances are reported among all six weighted net- 
works for different methods, either on the whole set of 210 miRNA or on the top-20 
set of optimal features. The corresponding multidimensional scaling projections are 
displayed in Fig. [3 

The distances among networks show that there are substantial differences not only 
between the Tumorous/NonTumorous tissue samples, but also between Male and Fe- 
male patients, both on the cancerous and the surrounding healthy tissue relevance 
networks. In particular, it can be pointed out that the networks corresponding to the 
tumoral tissue for female patients has a deeply different structure with respect to all 
other networks, while the differences between the models on all patients and those on 
the sole male population are smaller. This may be an effect of the different numerosity 
between male and female patients (210 versus 30) for WGCNA networks, but it is con- 
firmed also by the Aracne algorithm which is less sensible to sample size differences. 
Finally, CLR is the algorithm where the difference between the networks built on the 
whole set of features and those built on the top-20 subset are more relevant. 

We can conclude with an analysis of distances across inference methods for net- 
works associated to a given sample subset, listed in Tab. IH The structures inferred by 
WGCNA and CLR are the closest when the full set of features are used, but distances 
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are small among methods. The situation radically changes when the subset of opti- 
mal features is considered: in this case, WGCNA and Aracne tend to build up very 
similar networks for all the sample subsets, while CLR is going astray, confirming the 



observations drawn from the multidimensional scaling plots of Fig. [71 Table 4 



Subset stability in network reconstruction 

In this last example, we want to assess the stability of a network inferred by high- 
throughput data in terms of distances between networks generated from data sub- 
sampling. Sources of variability in this context are several: clS 8b CclSG study, here we 
consider three different publicly available (on GEO) microarray studies on the same 
pathology (colorectal cancer), on the same array platform (Affymetrix Human Genome 
U133 Plus 2.0 Array), with the same inference algorithm (WGCNA). References and 
deta ils on the three da tasets are listed in Tab. [51 The 33-genes signature from the pa- 



per (ISmith et al. 



20101) (developed for differentiating Dukes' stage A and D and tested 
on stages B and C) are selected as the vertices of the subnetwork to infer. The 33 Table 5 
genes map on 85 probes of the platform; during analysis, the expression of a gene is 
computed by averaging samplewise the expressions of all its mapping probes. The list 
of the 33 genes included in the signature, together with the mapped probes, is included 



in Tab. [61 In Fig. [HI we show as an example three of the coexpression graphs (on the Table 6 
whole set of data) for stages C and D for three different datasets. The node numbering 
is taken from Tab. [6l the node size is proportional to its degree and the edge width is 



proportional to its weight. Figure 8 

To quantify network stability, for a given dataset and stage, we select a random 
fraction p of the data and we generate the corresponding WGCNA; this procedure is 
repeated times, so to end up with WGCNA for each configuration (dataset, stage 
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and p). Then all mutual (^) = Ipsen-Mikhailov distances are computed, and 

for each set of graphs we build the corresponding distance histogram, reporting also 
mean and standard deviation. The lower the mean and the variance, the stabler the 
inferred network. 



In Tab. [7] we show the results for p = \ and P = |, with = 100 replicates. For Table 7 
stages A, B and C the best stability is detected on the GSE14333 dataset, while for 
stage D GSE175536/8 results the stabler dataset. Moreover, the (/i, cr^) couples listed 
in Tab. [7] do not show a great variability among the 24 listed cases. A larger range of 
situations can be appreciated by looking at the shapes of the distribution of each set of 
4950 distances. In fact, although some of the cases are almost gaussian-like, a number of 
other combinations of dataset and stage are represented by very skewed distribution: for 
them, considering mean and variance as descriptive parameters may be interpretatively 
misleading. As an example, GSE14333, Stage D and GSE17536/8, Stage B for p = | 
have rather similar mean and variance, but a quite different distribution shape as shown 



by Fig. [9]). We can conclude observing that the plotted histograms show how in several Figure 9 
cases the infered network can be heavily dependent on the particular chosen subset of 
data, leaving the network built on the whole dataset affected by a relatively large 
level of uncertainty (instability): this may be due both to high variability in the data 
distribution, but also to high sensibility of the algorithm to data perturbation. This fact 
should always be taken into account when assessing the reliability of a reconstructed 
network in order to avoid drawing biological consideration from a possible false positive 
edge linking two nodes. 
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5 DISCUSSION 

Ipsen-Mikhailov e distance is an effective metric for comparing (biological) networks 
in various situations. Its definition involves the distribution of the Laplacian spectrum 
of the networks, so it deal with the structure of the underlying graph, rather than 
focussing on the local pattern of the wiring differences. It is mostly consistent with 
more classification measures such as MCC, but it allows detection of finer differences. 
The presented examples show effectiveness and usefulness of e in different biological 
tasks, but additional applications can be considered wherever a quantitative network 
comparison is needed. Finally, the use of Ipsen-Mikhailov distance in Transcription 
Starting Sites network will be presented within the Fantomqj initiative led by Riken 
Institute. 
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Figure 1: MCC versus e distance for 1000 pairs of randomly generated topologies on 
1000 nodes. 
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Mikhailov distance and (c) the network A'^max with maximal distance from the ground 
truth. 
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Figure 5: Relevance network at correlation 0.85 for (left) the tumoral samples (T) and 
(right) the control tissues (nT), for the whole dataset (M+F), the male patients (M) 
and the female patients (F), with the top-20 ranked features marked as red nodes. 




Figure 6: Evolutions of distances between 4 couples of correlation networks as a func- 
tion of the correlation threshold. 



0.2- 


FN 






MN 


0.4- 


FN 




0.1 - 








• 

MFN 


0.2- 




MFN 
• 

MN 


0.0- 










0.0- 
-0.2- 




• MT 

MFT» 


0.1 - 




















• FT 


MFT 
• 


• MT 


-0.4- 


• FT 





-0.05 0.00 0.05 -0.2 -0.1 0.0 0.1 



Weighted WGCNA networks 







FN 






0.2- 




FN 


0.10- 
















0.05- 


• 


FT 






0.1 - 




MFN • 

MN 


0.00 - 






MN , 

MFN 




0.0- 


• FT 




0.05- 
0.10- 


MFT 
• 


MT 






-0.1 - 
-0.2- 




MT • 

MFT • 



-0.02 0.00 0.02 0.04 0.06 -0.20 -0.15 -0.10 -0.05 0.00 0.05 



Weighted Aracne networks 





• 

MFN 




MN 


0.15- 


• FT 

MFT • 


MT • 


0.05- 








0.10- 
0.05- 
0.00- 






0.00- 


FN 


MFT 




-0.06 - 
-0.10- 


MFN 




0.05- 




MT •* 

• FT 




-0.15- 


* MN 


FN 



-0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 



Weighted CLR networks 



Figure 7: Multidimensional scaling of the distances listed in Tab. [3l for the complete 
set of miRNA (left panels) and the top-20 subset (right panels). 
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Figure 8: Examples of topology of weighted coexpression networks: (a) GSE14333, 
Stage C (b) GSE14333, Stage D (c) GSE5206, Stage C (c) 17536/8, Stage C. Node 
numbering inherited from Tab. [6l node size and edge width proportional to degree and 
weight respectively. 
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Table 1: Workflow of the machine learning pipeline for the profiling tasks on the TiCC 
dataset. 



1, Preprocessing phase: imputation of missing values jlrovanskaya et al.| , laOQlj ) and discarding probes corresponding to non-human 
(mouse and controls) miRNA; 

2, Obtaining a dataset T-iCC of 240+240 paired samples described by 210 human miRNA 

3, Three profiling experiments: discriminate the two classes Tumoral (T) and non Tumoral (nT) within the whole set "HCC , in the 
subset T-iCC M of the 210+210 samples belonging to male patients (M) ajid in the subset 7-iCCfp of the 30+30 samples belonging to 
female (F) patients ; 



4. Data Analysis Protocol (DAP) as in Isudhu et alj . [200^ ): 1000 X 10-fold Cross Validation; 



5 . Classifier : Spectral Regression Discriminant Analysis (SRDA) jcai et al J . [200^) , a — 100, Feature Ranking: Entropy-based 



Recursive Feature Elimination (E-RFE) BFurlanello et al . 



, 12003! ) : 



6, Performance: MCC averaged on the 1000 test set for models with different number of features; confidence intervals are computed 
as 95°/, student bootstrap; 



7. Ranked list Stability: the Canberra stability indicator / jjurman et al.] , |20Q£| ) defined as the mean of mutual Canberra 
distcinces among the lists, normalized with respect to the whole set of possible permutations. The smaller the indicator value, 
the higher the stability level of the lists, with corresponding to a set of 10000 identical lists and 1 to a set of randomly 
ranked lists; 

8. Results: The model with 20 features is a reasonable compromise between classifier performance, list stability and small number 
of features: for the tasks with all samples, MCC — 0.845 CI — (0.839,0.850), I — 0.166, while for the other two cases the 
analogous values are M ^ (0.931, (0.927, 0.934), 0.323) and F ^ (0.859, (0.846, 0.871), 0.349); 



9. Dptimal list: for each of the three problems is computed as the top-20 sublist of the whole Borda list jJurman et al.| , |200i 
IsordJ . 1 178 J ) : 

10. In Tab. [2] we list the 16 miRJJA common to at least two out of the three top-20 models: in particular, 7 miRNA are common to all 
the three problems. 
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Table 2: Common miRNA 



HCC 


hsa-mir-021-prec-17Nol 


hsa-mir-099-prec-21 




hsa-mir- 1 28b-precNo 1 


hsa-mir-2 INol 


'HCCm,f 


hsa-mir-221-prec 


hsa-mir-222-precNol 




hsa-mir-2 6 a- 1 No2 




ncc 


hsa-mir- 1 2 2a-prec 




HCCm 






HCC 


hsa-mir- lOONol 


hsa-mir- 125b- 1 




hsa-mir- 1 99b- precNo2 




nccp 


hsa-mir-219-lNo2 


hsa-mir-222-precNo2 


1-LCCm 


hsa-mir- 130a- precNo2 




HCCf 


hsa-mir- 146-prec 





34 



Table 3: Ipsen-Mikhailov distances among all six networks for different inference meth- 
ods, on the whole set of 210 miRNA (upper triangular) or on the top-20 set of optimal 
features (lower triangular). 



Mutual distances for weighted WGCNA networks 





F T 


M T 


M+F T 


F nT 


M nT 


M+F nT 


F T 




0.1440 


0.1228 


0.3538 


0.3056 


0.2929 


M T 


0.4091 




0.0838 


0.3498 


0.2845 


0.2742 


M+F T 


0.3996 


0.1091 




0.3587 


0.3012 


0.2871 


F nT 


0.7648 


0.5980 


0.6272 




0.1659 


0.1634 


M nT 


0.5998 


0.3687 


0.4176 


0.4403 




0.0500 


M+F nT 


0.6197 


0.3962 


0.4389 


0.4344 


0.0594 




Mutual distances for weighted Aracne networks 




F T 


M T 


M+F T 


F nT 


M nT 


M+F nT 


F T 




0.1764 


0.1636 


0.1219 


0.1210 


0.1179 


M T 


0.3162 




0.0408 


0.2358 


0.1132 


0.1299 


M+F T 


0.3237 


0.1241 




0.2280 


0.1075 


0.1271 


F nT 


0.3604 


0.4183 


0.4400 




0.1685 


0.1658 


M nT 


0.2934 


0.2454 


0.2926 


0.2344 




0.0570 


M+F nT 


0.2945 


0.2960 


0.3370 


0.2120 


0.1194 




Mutual distances 


for weighted CLR networks 






F T 


M T 


M+F T 


F nT 


M nT 


M+F nT 


F T 




0.0043 


0.0037 


0.0056 


0.1194 


0.1233 


M T 


0.3260 




0.0008 


0.0030 


0.1158 


0.1102 


M+F T 


0.2441 


0.2506 




0.0028 


0.1123 


0.1067 


F nT 


0.4223 


0.3571 


0.3788 




0.0964 


0.0017 


M nT 


0.3380 


0.3669 




0.3235 




0.0011 


M+F nT 


0.3318 


0.3577 


0.3012 


0.3251 







Table 4: Ipsen-Mikhailov distances between weighted network inferred from the same 
sample subsets, for different inference methods WGCNA (W), Aracne (A) and CLR 
(C), for the whole set of miRNA (upper half) and the optimal top-20 miRNA subset 
(lower half). 



Full set of 210 miRNA 





W,A 


W,C 


A,C 


F T 


0.6693 


0.5732 


0.7164 


M T 


0.6860 


0.5648 


0.7176 


MF T 


0.6831 


0.5637 


0.7207 


F nT 


0.6821 


0.5289 


0.7012 


M nT 


0.6678 


0.5360 


0.7011 


M+F nT 


0.6650 


0.5337 


0.6998 


Optimal subset of 20 miRNA 




W,A 


W,C 


A,C 


F T 


0.2537 


0.9567 


0.9365 


M T 


0.3557 


0.8320 


0.8984 


M+F T 


0.3807 


0.8343 


0.9103 


F nT 


0.5192 


0.8335 


0.8625 


M nT 


0.3707 


0.8198 


0.8562 


M+F nT 


0.3628 


0.8192 


0.8555 
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Table 5: Size of patient cohorts grouped by disease stage 





Reference and GEO Accession Number 




fJorissen et al.. 2009) 


fKaiser et al.. 2007) 


rSmith et al.. 2010) 


Dukes/AJCC Stage 


GSE14333 


GSE5206 


GSE17536/GSE17538 


A/I 


44 


12 


28 


B/II 


94 


32 


72 


C/II 


91 


33 


76 


D/IV 


61 


21 


56 
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Table 6: The 33-gene CRC signature described in (jSmith et al.l . 120101 ) . with the mapped 
probes on the Affymetrix Human Genome U133 Plus 2.0 Array platform. 





CJenc iijiiiic 


Ensemble ID 


Mtippcd probes 


1 


ACTB 


ENSG00000075624 


200801_x_at 213867_x_at 224594_x_at AFFX-HSAC07/X00351_3_at 
AFPX-HSAC07/X00351.5.at APFX-HSAC07/X00351.M_at 


2 


DFNB31 




221887 s at 47553 at 


3 


TMEM14A 


ENSG00000096092 


218477_at 


4 


CIRBP 




200810 s at 200811 at 225191 at 228519 x at 230142 s at 


5 


SYT17 


ENSG00000103528 


205613_at 229053_at 


g 


AKl 


ENSG00000106992 


202587_s_at 202588_at 


7 


MGP 


ENSGOOOOOl 11341 


202291_s_at 238481_at 




VDR 


ENSGOOOOOl 11424 


204253_s_at 204254_s_at 204255_s_at 213692_s_at 


9 


C6orf'64 


ENSGOOOOOl 12 167 


218784_s_at 222741_s_at 232992_at 


10 


HESl 


ENSGOOOOOl 14315 


203393_at 203394_s_at 203395_s_at 


11 


TEX 11 


ENSG0000012049S 


221259_s_at 233514_x_at 234296_s_at 


12 


MYOT 


ENSG00000120729 


219728_at 


13 


EGRl 




201693 s at 201694 s at 227404 s at 


14 


DCTD 


ENSG00000129187 


201571_s_at 201572_x_at 210137_s_at 


15 


MMP13 


ENSG00000137745 


205959.at 


16 


TACC2 


ENSG00000138162 


1570025_at 1570546_a_at 202289.s_at 211382_s_at 


17 


CXCR7 


ENSG00000144476 


1559114_a_at 212977_at 232746_at 


18 


DENND2A 


ENSG00000146966 


221885_at 221886_at 53991.at 


19 


MUMILI 


ENSG00000157502 


229160_at 


20 


PDLIM5 


ENSG00000163110 


203242_s_at 203243_s_at 211680_at 211681_s_at 212412_at 
213684_s_at 216803_at 216804_s_at 221994_at 241208.at 


21 


SPDYA 


ENSG00000163806 


238262_at 


22 


NMNAT3 


ENSG00000163864 


228090_at 243738_at 


23 


CRABPl 


ENSG00000166426 


1563897_at 205350_at 


24 


ACYP2 


ENSG00000170634 


206833_s_at 


25 


CSN3 


ENSG00000171209 


207803_s_at 


26 


HPSE 


ENSG00000173083 


219403_s_at 222881_at 


27 


STOX2 


ENSG00000173320 


226822_at 231969_at 234317_s_at 234319.at 


28 


SLC25A30 


ENSG00000174032 


226782_at 238171_at 


29 


NQOl 


ENSG00000181019 


201467_s_at 201468_s_at 210519_s_at 


30 


SPRY4 


ENSG00000187678 


220983_s_at 221489_s_at 


31 


S100A3 


ENSG00000188015 


206027_at 


32 


PRTN3 


ENSG00000196415 


207341_at 


33 


HS3ST5 


ENSG00000249853 


240479.at 
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Table 7: Mean and variance for the sets of 4950 distances for all combinations of dataset 
and stage, in the two cases of using 2/3 and 1/2 of the data. 





Stage A 


Stage B 


Stage C 
\x 


Stage D 


GSE14333 

GSE5206 

GSE17536/8 


0.1988 0.0065 
0.2457 0.0073 
0.3306 0.0169 


0.1787 0.0088 
0.2180 0.0051 
0.2301 0.0040 


0.1843 0.0086 
0.2777 0.0137 
0.2199 0.0028 


0.2447 0.0112 
0.2409 0.0086 
0.2173 0.0038 




Stage A 

9 

/' ^" 


Stage B 

9 

/' 


Stage C 

/' 


Stage D 

9 

/' ^" 


GSE14333 

GSE5206 

GSE17536/8 


0.2075 0.0066 
0.2602 0.0070 
0.3668 0.0176 


0.2135 0.0112 
0.2464 0.0071 
0.2681 0.0048 


0.2028 0.0104 
0.2931 0.0146 
0.2646 0.0044 


0.2978 0.0159 
0.2596 0.0095 
0.2483 0.0061 
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